Search Results
PR-314: VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio, and Text
Transformers for Multimodal Self Supervised Learning from Raw Video, Audio and Text | NeurIPS 2021
VATT 논문 리뷰 (Transformers for Multimodal Self-Supervsied Learning from Raw Video, Audio and Text)
Data2vec: A general framework for self-supervised learning in speech, vision and language
Multi-Modal Self-Supervised Learning from Videos
cereproc Capture 15 Text To speech!
DCASE Workshop 2021, ID 70 - Transfer Learning followed by Transformer for Automated Audio Captio...
Stanford CS25: V1 I Audio Research: Transformers for Applications in Audio, Speech, Music
Transformer is All You Need - Multimodal Multitask Learning with a Unified Transformer
Relaxing Contrastiveness in Multimodal Representation Learning
RS-024: data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
PR-315: Taming Transformers for High-Resolution Image Synthesis